133 research outputs found

    Exploring concept representations for concept drift detection

    Get PDF
    We present an approach to estimating concept drift in online news. Our method is to construct temporal concept vectors from topicannotated news articles, and to correlate the distance between the temporal concept vectors with edits to the Wikipedia entries of the concepts. We find improvements in the correlation when we split the news articles based on the amount of articles mentioning a concept, instead of calendar-based units of time

    Bias in the analysis of multilingual legislative speech

    Get PDF
    In this paper we investigate the application of natural language processing tools to the multilingual proceedings of the European Parliament. This work is part of a study in which we explore (1) how subcorpora in different languages may lead to different conclusions about the political landscape, (2) how to determine what a potential language-related bias originates from, and (3) to what extent we can limit or even prevent an unwanted language-bias

    A corpus of images and text in online news

    Get PDF
    In recent years, several datasets have been released that include images and text, giving impulse to new methods that combine natural language processing and computer vision. However, there is a need for datasets of images in their natural textual context. The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries. The 1-year coverage over multiple publishers ensures a broad scope in terms of topics, image quality and editorial viewpoints. The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisher’s website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image. Neither the article text nor the images themselves are included in the corpus. Instead, the images are distributed as high-dimensional feature vectors extracted from a Convolutional Neural Network, anticipating their use in computer vision tasks. The article text is represented as a list of automatically generated entity and topic annotations in the form of Wikipedia/DBpedia pages. This facilitates the selection of subsets of the corpus for separate analysis or evaluation

    SWISH DataLab: A Web Interface for Data Exploration and Analysis

    Get PDF
    SWISH DataLab is a single integrated collaborative environment for data processing, exploration and analysis combining Prolog and R. The web interface makes it possible to share the data, the code of all processing steps and the results among researchers; and a versioning system facilitates reproducibility of the research at any chosen point. Using search logs from the National Library of the Netherlands combined with the collection content metadata, we demonstrate how to use SWISH DataLab for all stages of data analysis, using Prolog predicates, graph visualizations, and R

    Interchanging lexical resources on the Semantic Web

    Get PDF
    Lexica and terminology databases play a vital role in many NLP applications, but currently most such resources are published in application-specific formats, or with custom access interfaces, leading to the problem that much of this data is in ‘‘data silos’’ and hence difficult to access. The Semantic Web and in particular the Linked Data initiative provide effective solutions to this problem, as well as possibilities for data reuse by inter-lexicon linking, and incorporation of data categories by dereferencable URIs. The Semantic Web focuses on the use of ontologies to describe semantics on the Web, but currently there is no standard for providing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology. We present our model, lemon, which aims to address these gap

    A larger-scale evaluation resource of terms and their shift direction for diachronic lexical semantics

    Get PDF
    Determining how words have changed their meaning is an important topic in Natural Language Processing. However, evaluations of methods to characterise such change have been limited to small, handcrafted resources. We introduce an English evaluation set which is larger, more varied, and more realistic than seen to date, with terms derived from a historical thesaurus. Moreover, the dataset is unique in that it represents change as a shift from the term of interest to a WordNet synset. Using the synset lemmas, we can use this set to evaluate (standard) methods that detect change between word pairs, as well as (adapted) methods that detect the change between a term and a sense overall. We show that performance on the new data set is much lower than earlier reported findings, setting a new standard
    • 

    corecore